Machine learning-assisted rapid determination for traditional Chinese Medicine Constitution

The aim of this study was to develop a machine learning-assisted rapid determination methodology for traditional Chinese Medicine Constitution. Based on the Constitution in Chinese Medicine Questionnaire (CCMQ), the most applied diagnostic instrument for assessing individuals’ constitutions, we employed automated supervised machine learning algorithms (i.e., Tree-based Pipeline Optimization Tool; TPOT) on all the possible item combinations for each subscale and an unsupervised machine learning algorithm (i.e., variable clustering; varclus) on the whole scale to select items that can best predict body constitution (BC) classifications or BC scores. By utilizing subsets of items selected based on TPOT and corresponding machine learning algorithms, the accuracies of BC classifications prediction ranged from 0.819 to 0.936, with the root mean square errors of BC scores prediction stabilizing between 6.241 and 9.877. Overall, the results suggested that the automated machine learning algorithms performed better than the varclus algorithm for item selection. Additionally, based on an automated machine learning item selection procedure, we provided the top three ranked item combinations with each possible subscale length, along with their corresponding algorithms for predicting BC classification and severity. This approach could accommodate the needs of different practitioners in traditional Chinese medicine for rapid constitution determination. Supplementary Information The online version contains supplementary material available at 10.1186/s13020-024-00992-0.


Introduction
The Yellow Emperor's Classic of Medicine introduced the concept of "weibing, " which refers to the state preceding the onset of disease, essentially describing a condition of subhealth.Within conventional medicine paradigms, the state between health and disease is referred to as the "third state".However, the scope of the "third state" is broad, and its mechanisms are unclear, which complicates targeted intervention strategies.In traditional Chinese medicine (TCM), there are systematic theories and intervention methods specifically designed to address subhealth issues.One of the common theoretical frameworks in TCM for subhealth is described using body constitution (BC), which is relatively stable across an individual's lifespan.is assumed, these traditional scoring methods often overlook the need to account for the relative weights of different items.To further improve the efficiency of the determination of BC classifications and scores, this study introduces machine learning-assisted methods for their rapid determination.These methods enable automatic classifications of BC and calculation of BC scores using selected subsets of CCMQ items.Utilizing both linear and nonlinear algorithms, machine learning-based approaches incorporate the relative weights of items, thereby refining the scoring and classification processes for BC types.
Machine learning methodologies can be broadly categorized into two primary types: supervised and unsupervised machine learning.During the process of supervised machine learning-assisted rapid determination, items are utilized as predictors, and each BC classification or score is used as the target to construct predictive models.A subset of core items that contribute most to the predicted outcome are chosen in the training process, and then they are used as predictors to obtain the outcome (i.e., BC classification or score) in the testing process.Common supervised machine learning algorithms for item selection include support vector machines, elastic net, extreme gradient boosting, gradient boosting, k-nearest neighbors, random forests, and extremely randomized trees.These algorithms have been proven effective in various studies [16,17] and are considered potential supervised learning options in this study.In contrast, unsupervised machine learning does not involve predefined targets but instead aims to discover intrinsic data patterns and groupings based on the relationships among all items, potentially transcending the limitations of predefined dimensions.Using unsupervised methods, the most representative items for inherent groupings were retained.These items are subsequently utilized as features in supervised machine learning algorithms to predict BC classifications and scores.The dimensional structure and item formulation of the CCMQ are primarily designed based on the experiential insights of experts in TCM, lacking consistent structural validity supported by empirical data.As mentioned previously, supervised and unsupervised machine learning algorithms each offer distinct advantages in analyzing the relationships between the CCMQ dimensions and items.Supervised algorithms operate on the premise of acknowledging existing dimensions, while unsupervised algorithms do not.In the process of predicting BC classifications or scores with a limited number of items, comparing the selections made by these two approaches can help identify consistencies and discrepancies in item choice, which further reveals items that are crucial for the stability of the CCMQ dimensions.
In summary, this study illustrated the application of various machine learning methods to rapidly determine BC classifications and calculate BC scores.Specifically, we utilized an automated machine learning algorithms (AutoML) known as the Tree-based Pipeline Optimization Tool (TPOT), as well as unsupervised machine learning through variable clustering analysis (varclus).Using machine learning methods (i.e., model selection and parameter tuning), the predicted BC classifications or scores from a subset of core items are expected to be highly correlated with the original classifications or scores; thus, a subset of core items can be used instead of the whole set to improve the test efficacy of the CCMQ.

Data source and collection
Between August 26, 2015, and October 12, 2017, a survey was conducted via a web-based platform using the CCMQ to assess the BC of individuals aged 15-64 across China.A total of 94,718 questionnaires were collected.However, 3573 responses were discarded due to fictitious or inconsistent answers, yielding 91,145 valid questionnaires for analysis.

Measurement instrument
This study employed the 60-item adult version of the CCMQ developed by Wang et al. (2006).The questionnaire is organized into nine subscales, each containing 6-8 items.Questions are presented in a question format (e.g., "Were you energetic?")Responses are measured using a five-point Likert scale ranging from "1.Not at all" to "5.Very much." Specific items associated with the GTC subscale include the following: "(2) Did you become tired easily?", "(7) Was your voice weak when talking?", "(8) Did you feel in low spirits and depressed?","(21) Did you feel more vulnerable to the cold than others (winter coldness, air conditioners, fans, etc.)?", "(54) Did you easily experience insomnia?",and "(27) Did you forget things easily?" are reverse scored.In contrast, these items are positively scored in other BCs as well as the remaining items.The formula for calculating the converted score is as follows: BC types were classified based on the criteria established in the "Classification and Determination of Chinese Medicine Constitution" [18] .Specifically, the criterion for identifying GTC requires a minimum conversion score of 60, whereas the conversion scores for the remaining biased BCs are less than 30.For the eight biased BCs, a converted score above 40 indicates their presence, and scores ranging from 30 to 40 suggest a predisposition toward a specific constitution.In this study, the Cronbach's alpha coefficients for the subscales were as follows: GTC at 0.664, QDC at 0.742, YaDC at 0.778, YiDC at 0.673, PDC at 0.706, DHC at 0.649, BSC at 0.660, QSC at 0.775, and SDC at 0.721.

Supervised machine learning
AutoML was employed to select the optimal supervised machine learning algorithm for BC classifications and scores, facilitating the implementation of traditional machine learning model design strategies in an automated, data-driven manner.The Tree-based Pipeline Optimization Tool (TPOT), a prevalent AutoML algorithm, automatically designs and optimizes machine learning pipelines for specific problem domains without human intervention [19].TPOT primarily comprises two components: model construction via genetic programming and optimal model selection using Pareto efficiency.First, predictive models were built using all possible CCMQ item combinations as predictor variables, ranging from combinations with single item to all items, with the BC classifications and scores from the original subscales as the predicted outcomes.During the model construction process, new pipeline configurations are generated through crossover and mutation operations within the genetic algorithm.However, a challenge arises because maximizing model predictive performance (the model with the highest accuracy for classification and the model with the smallest mean squared error (MSE) for regression) often results in increased model complexity.By leveraging genetic programming in conjunction with Pareto optimality, the model selected by TPOT effectively balances predictive performance against complexity.Based on the model selected by the TPOT, the item combination that demonstrated the best performance across all item combinations with the same number of items was selected using the area under the curve (AUC) or R-squared (R 2 ).Then, given the best item combinations for a specific number of items, we may need to further determine the most appropriate number of items for predicting BC classifications or scores.With this aim, we calculated the AUC or R 2 improvement for the predictive performance of an item combination with at least two items over the item combination with one fewer item in predicting the original BC classifications or scores.In this study, the item combinations that show obvious improvements in AUC and R 2 are recommended for rapidly assessing BC classifications and scores.When the improvement progresses steadily, it is recommended to use a threshold of 0.8 for AUC and R 2 , which are generally considered indicators of excellent predictive performance [20,21], to determine the minimum number of items to be retained when predicting BC classifications or scores.Finally, the predictive performance will be assessed by AUC, accuracy, F1 score, R 2 , root mean square error (RMSE) and mean absolute percentage error (MAPE).

Unsupervised machine learning
Variable clustering analysis (varclus) was used in this study as the unsupervised machine learning method, which groups similar items into clusters.Each cluster can be represented by a single item.This approach reduces data complexity while ensuring interpretability.Based on the clusters, representative items are selected, and the scale items are simplified accordingly.
First, the 60 items of the CCMQ were normalized to form a single cluster.This cluster was then iteratively segmented, continuing until the second eigenvalue within a cluster did not exceed the threshold (shown in Fig. 1).In this study, the thresholds were dynamically adjusted to capture clusters of all possible quantities (i.e., 1-60).Next, representative items were selected from each cluster.The selection of representative items is based on the following formula: represents the jth variable's maximum loading on the first principal component of all other clusters, excluding the cluster to which it belongs.The item with the highest ratio is chosen as the representative item for a given cluster.
In other studies, the number of principal components is typically determined using a cumulative variance contribution rate of 70-90% [22].In this study, we adopt the average variance contribution rate at a median level within this range to determine the number of clusters.We selected the least number of clusters based on the criterion that the average within-cluster variance explained by the representative item is 80%.The items selected by varclus were used to predict BC classifications or scores by selecting the appropriate supervised machine learning methods using the TPOT algorithms, and their predictive capacities were evaluated using AUC, accuracy, and F1 score (or R 2 , RMSE, and MAPE).

Distribution of scores among different BCs
Figure 2 shows that the scores for the eight biased BCs exhibit a pronounced left-skewed distribution.Among the seven biased BCs (i.e., QDC, YaDC, YiDC, PDC, DHC, BSC, and QSC), there was a greater number of individuals in the 20-50 score range.

Item selection based on automated machine learning
The optimal performance of the supervised machine learning item selection models, utilizing BC classifications as the dependent variable across various item combinations, is presented in Table 1, with the corresponding Fig. 2 Distribution of scores for the BCs algorithms detailed therein.Figure 3 illustrates the improvement in the AUC for the optimal item combinations for predicting each BC classification.
For all models except the QSC model, the AUC plots revealed elbow points for either two or four items: GTC: item 2 (i.e.,tiredness; abbreviated form throughout; items, questions, and question abbreviations are provided in Table S1), item 8 (i.e., depression), item 21 (i.e., cold intolerance), item 27 (i.e., forgetfulness); QDC: item 3 (i.e., breathlessness), item 6 (i.e., quietude); YaDC: item 19 (i.e., cold aversion), item 52 (i.e., cold sensitivity); YiDC: item 20 (i.e., localized hotness), item 35 (i.e., dryness); PDC: item 49 (i.e., sticky mouth), item 50 (i.e., flabby abdomen); DHC: item 39 (i.e., oily skin), item 59 (i.e., urethral heat); BSC: item 40 (i.e., hyperpigmentation), item 43 (i.e., dark circles); and SDC: item 24 (i.e., chronic rhinitis), item 31 (i.e., urticaria).Remarkably, the QSC model achieved the predefined threshold of AUC = 0.8 with the inclusion of only one item (item 9: anxiety).The selected items achieved AUC values ranging from 0.857 to 0.946 (shown in Table 3).In general, predictive models are considered excellent when their AUC values fall between 0.8 and 0.9 and outstanding when they exceed 0.9 [21]; therefore, all these models Table 1 The optimal item combinations for BC classifications as the dependent variable and their corresponding algorithms The specific meanings represented by the entries are found in the supplementary materials demonstrated excellent predictive performance, with some even reaching outstanding levels.Additionally, the accuracy and F1 scores of these models' predictions were calculated, with accuracy values ranging from 0.819 to 0.936 and F1 scores ranging from 0.417 to 0.807.Table 2 presents the optimal performance of the supervised machine learning item selection models with the BC score as the dependent variable.Figure 4 demonstrates the improvement in R 2 performance for these optimal models.
In Fig. 4, a clear elbow point is shown for the YaDC model with a two-item combination (item 19: cold aversion and item 52: cold sensitivity).Other BC score predictive models did not exhibit significant elbow points.Accordingly, the item combinations for the final BC scores prediction were selected from the models with an R 2 exceeding 0.8 that used the fewest predictors.Specifically, the models for GTC, QDC, YiDC, PDC, DHC, and BSC included the first four items (GTC: item 2: tiredness, item 8: depression, item 21: cold intolerance, item 53: adaptability; QTC: item 3: breathlessness, item 5: dizziness, item 6: quietude, item 26: hyperhidrosis; YiDC: item 20: localized hotness, item 44: dry eyes, item 46: thirstiness, item 57: constipation; PDC: item 15: lethargy, item 28: oily T-zone, item 49: sticky mouth, item 50: flabby abdomen; DHC: item 39: oily skin, item 48: bitter mouth, item 56: sticky stools, item 60: wet scrotum/yellowing leukorrhea; BSC: item 27: forgetfulness, item 37: pain, item 40: hyperpigmentation, item 43: dark circles, respectively).Meanwhile, the QSC and SDC models incorporated the first three items (QSC: item 9: anxiety, item 10: vulnerability, item 14: sighing; SDC: item 24: chronic rhinitis, item 30: allergies, item 34: Fig. 3 AUC for different item combinations.Note.For the GTC, QDC, YaDC, YiDC, PDC, DHC, BSC and SDC, elbow points were selected where the item combination maximized the improvement in the AUC.For QSC, the curve was relatively smooth, so we selected the item combination with the fewest number of items when the AUC exceeded 0.8 dermatographism, respectively), aligning with this criterion.For predicting BC scores, the models yielded R 2 values ranging from 0.785 to 0.879 (shown in Table 3).
According to general guidelines that R 2 values of 0.75 and 0.50 are considered substantial and moderate, respectively, of predictive performance [20], all of Table 2 The optimal item combinations for the BC score as the dependent variable and their corresponding algorithms The specific meanings represented by the items are found in the supplementary materials

Item selection based on the unsupervised machine learning algorithm
For all possible numbers of clusters, the average variance contribution of all representative items within their respective clusters is illustrated in Fig. 5.When the number of clusters was set to 29, the representative items accounted for 80% of the variance within each cluster, on average.To further assess the predictive capability of these items for BC classification or score, the appropriate supervised machine learning models were constructed by TPOT, with these 29 items as independent variables and BC classifications or scores as dependent variables.The specific predictive algorithms and their corresponding performances are detailed in Table 4.
In terms of predicting BC classifications, the selected items achieved AUC values ranging from 0.847 to 0.965, prediction accuracies ranging from 0.820 to 0.950 and F1 scores from 0.294 to 0.858.For predicting BC scores, the models yielded R 2 values from 0.549 to 0.888, RMSE values from 5.602 to 12.135 and MAPE values from 14.470 to 48.239 (shown in Table 4).According to the criteria mentioned above [20,21], most models exhibited outstanding predictive performance, while a few were considered to have moderate predictive capability.Fig. 4 R 2 for different item combinations.For YaDC, an elbow point was selected where the item combination maximized the improvement in the AUC.For GTC, QDC, YiDC, PDC, DHC, BSC, QSC and SDC, the curves were relatively smooth, so we selected the item combination with the fewest number of items when the AUC exceeded 0.8
In the prediction of classifications, the items selected based on TPOT and those selected by varclus, using the appropriate prediction models, demonstrated similar AUCs, accuracies and F1 scores (Fig. 6B).However, except for the GTC model, TPOT selected fewer items for the biased BC classifications compared to varclus.

Table 3 Evaluation of the automated machine learning results based on the appropriate supervised machine learning methods
The specific meanings represented by the items are found in the supplementary materials

Subscale The BC classification as the target variable
The  In the prediction of scores for BC, such as GTC, PDC, DHC, QSC, and SDC, the items selected based on TPOT demonstrated notable advantages in comparison with those selected by varclus (Fig. 6C).However, in the prediction of scores for BC types such as QDC and YaDC, selections made by varclus and its corresponding predictive algorithms exhibited superior performance.For the YiDC and BSC score predictions, the performances of the items chosen by TPOT and varclus were similar.Overall, while the number of items selected by TPOT and varclus remains similar, the selections made by TPOT yield a higher average R 2 and lower average RMSE and MAPE.

Discussion
Self-report questionnaires are recognized in clinical practice as effective tools for quantifying abstract concepts, aiding in assessments of disease risk [23][24][25].As the use of questionnaires grows, there is an increased demand for them to possess strong predictive capabilities and to facilitate rapid disease determination.In this study, we utilized machine learning techniques for rapidly determining BCs.
The comparative analysis between AutoML and unsupervised machine learning in terms of item selection for BC classifications and scores revealed a slight advantage for AutoML.The reason might be that the predictive targets were predefined dimensions of the CCMQ.Thus, AutoML are recommended for predicting original BC classifications or scores.
Previous research simplified scales by calculating feature importance using supervised machine learning algorithms for predicting total scores with item scores and retaining items with high feature importance [16,17,36].However, this method faces the issue that different machine learning algorithms may assign items different weights [37].We may select the appropriate supervised machine learning algorithms before calculating feature importance, but this chosen algorithm may not perform best when using the selected items to predict the original scores.In this study, we innovatively selected the best-performing model from those built using all possible combinations of items as input variables.Thus, the absolute improvement in the models' predictive performance as the number of items used to predict the original BC classifications or scores increases can be calculated.For each possible number of items, the top three ranked item combinations in terms of predictive effectiveness, along with their corresponding algorithms, are listed in the supplemental material (Figure S1 and S2).Consequently, we can comprehensively understand the predictive capabilities of all possible item combinations and their corresponding algorithms and assist practitioners in selecting item combinations based on their understanding of TCM theories and specific needs in scale development (e.g., required reliability and validity, test efficiency).In summary, this study has significant implications for the BC identification in clinical practice.Firstly, the machine learning algorithms proposed in this study enable rapid BC identification based on a subset of items.Secondly, by comparing different supervised and unsupervised machine learning algorithms, it is possible to gain deeper insights into how different items contribute to the various BC dimensions, thereby assisting clinical practitioners in achieving a more thorough understanding of these dimensions.6 Comparison of prediction performance using the appropriate supervised machine learning for items selected based on TPOT and varclus.A The frequency of items selected based on TPOT and varclus.B Performance of items selected based on TPOT and varclus in predicting BC classifications using the appropriate supervised machine learning method.C Performance of items selected based on TPOT and varclus in predicting BC scores using the appropriate supervised machine learning method.In C, the RMSE measure is represented as RMSE/RMSEmax, and the MAPE is represented as MAPE/MAPEmax A major limitation of this study is the computational cost arising from the complexity of the algorithm combined with the large volume of data.Therefore, future research should focus on optimizing algorithms to enhance their processing speed and efficiency in big data environments.Also, in the future, we can further optimize the rapid body constitution identification process by integrating multimodal data, such as tongue and pulse diagnostics.

Conclusion
The items in the CCMQ were shown to have varying information weights.The use of highly important items may assist in the rapid determination of BCs.The use of supervised machine learning algorithms with all the possible item combinations for predicting BC classifications or scores achieved acceptable and stable predictive performance.The top item combinations obtained by supervised machine learning algorithms for predicting BC classifications or scores were identified so that other researchers can make selections according to their needs.
Converted Score = Raw Score − Number of Items Number of Items × 4 × 100

Fig. 5
Fig.5 The average variance contribution of the representative items

Fig.
Fig.6 Comparison of prediction performance using the appropriate supervised machine learning for items selected based on TPOT and varclus.A The frequency of items selected based on TPOT and varclus.B Performance of items selected based on TPOT and varclus in predicting BC classifications using the appropriate supervised machine learning method.C Performance of items selected based on TPOT and varclus in predicting BC scores using the appropriate supervised machine learning method.In C, the RMSE measure is represented as RMSE/RMSEmax, and the MAPE is represented as MAPE/MAPEmax

Table 4
Evaluation of varclus results based on the appropriate supervised machine learning methodsThe specific meanings represented by the items are found in the supplementary materials